# Helper packages
library(dplyr) # for data manipulation
library(ggplot2) # for awesome graphics
library(gt)Data Partitioning and Feature Engineering
1 Importing Data and Packages
The dplyr package is used for data manipulation and the ggplot2 package is used for data visualization. These are frequently used packages in machine learning and have a rich code library available on their website.
1.1 Modeling Process packages
library(rsample) # for resampling procedures
library(caret) # for resampling and model training
library(h2o) # for resampling and model training1.1.1 h20 library
H2O is a fully open-source, distributed in-memory machine-learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient-boosted machines, generalized linear models, deep learning and more. H2O also has an industry-leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.
# h20 set up
h2o.no_progress() # turn off h20
h2o.init() # launch h20
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpAUmhtW/file3e9c382a9e8/h2o_runner_started_from_r.out
/tmp/RtmpAUmhtW/file3e9c73cda24/h2o_runner_started_from_r.err
Starting H2O JVM and connecting: .... Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 642 milliseconds
H2O cluster timezone: UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.44.0.3
H2O cluster version age: 1 year, 10 months and 15 days
H2O cluster name: H2O_started_from_R_runner_mtg758
H2O cluster total nodes: 1
H2O cluster total memory: 3.91 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 4.5.0 (2025-04-11)
Warning in h2o.clusterInfo():
Your H2O cluster version is (1 year, 10 months and 15 days) old. There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
1.2 Loading and Importing Data
We first need to load the dataset from csv. In real world scenario, this will be a database connection.
# Read your dataset
ames <-read.csv("../data/AmesHousing2.csv")
# ames <- AmesHousing::make_ames()we need to convert the original dataset into an H20 object (to run the models)
ames.h2o0 <-as.h2o(ames)
head(ames$SalePrice) # response variable[1] 171000 139000 159000 136000 230000 190000
target <- 'SalePrice'
predictors <- setdiff(colnames(ames), target)Let’s see what the data looks like by looking at the first few rows of the dataset.
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
kable(head(ames), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")| Order | Lot.Frontage | Lot.Area | Lot.Shape | Utilities | Lot.Config | Land.Slope | Neighborhood | Bldg.Type | House.Style | Overall.Qual | Overall.Cond | Year.Built | Year.Remod.Add | Foundation | Bsmt.Unf.SF | Total.Bsmt.SF | BaseLivArea | Central.Air | X1st.Flr.SF | X2nd.Flr.SF | Gr.Liv.Area | Full.Bath | Half.Bath | Bathrooms | Bedroom.AbvGr | Kitchen.AbvGr | Kitchen.Qual | TotRms.AbvGrd | Fireplaces | Garage.Type | Garage.Area | Wood.Deck.SF | Open.Porch.SF | Enclosed.Porch | X3Ssn.Porch | Screen.Porch | Mo.Sold | Yr.Sold | Sale.Condition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 117 | 80 | 9600 | Reg | AllPub | Inside | Gtl | NWAmes | 1Fam | 2Story | 6 | 6 | 1971 | 1971 | CBlock | 386 | 715 | 329 | Y | 930 | 715 | 1645 | 1 | 2 | 2 | 4 | 1 | TA | 7 | 0 | Attchd | 441 | 0 | 78 | 0 | 0 | 0 | 6 | 2010 | Normal | 171000 |
| 325 | 94 | 9400 | Reg | AllPub | Corner | Gtl | Mitchel | Duplex | 2Story | 6 | 5 | 1971 | 1971 | CBlock | 912 | 912 | 0 | Y | 912 | 912 | 1824 | 2 | 2 | 3 | 4 | 2 | TA | 8 | 0 | NA | 0 | 128 | 0 | 0 | 0 | 0 | 4 | 2010 | Normal | 139000 |
| 337 | 70 | 7700 | Reg | AllPub | Inside | Gtl | Mitchel | Duplex | 2Story | 5 | 2 | 1985 | 1986 | PConc | 1216 | 1216 | 0 | Y | 1216 | 1216 | 2432 | 4 | 2 | 5 | 4 | 2 | TA | 10 | 0 | Attchd | 616 | 200 | 0 | 0 | 0 | 0 | 2 | 2010 | Normal | 159000 |
| 393 | 60 | 9000 | Reg | AllPub | FR2 | Gtl | NAmes | Duplex | 2Story | 5 | 5 | 1974 | 1974 | CBlock | 896 | 896 | 0 | Y | 896 | 896 | 1792 | 2 | 2 | 3 | 4 | 2 | TA | 8 | 0 | NA | 0 | 32 | 45 | 0 | 0 | 0 | 9 | 2009 | Normal | 136000 |
| 590 | NA | 13774 | IR1 | AllPub | Inside | Gtl | NWAmes | 1Fam | 2Story | 7 | 7 | 1977 | 1992 | PConc | 476 | 908 | 432 | Y | 1316 | 972 | 2288 | 1 | 2 | 2 | 4 | 1 | Gd | 8 | 2 | Attchd | 520 | 321 | 72 | 0 | 0 | 156 | 11 | 2009 | Normal | 230000 |
| 667 | 81 | 9671 | Reg | AllPub | Corner | Gtl | NAmes | Duplex | 2Story | 6 | 5 | 1969 | 1969 | CBlock | 1248 | 1248 | 0 | Y | 1248 | 1296 | 2544 | 2 | 2 | 3 | 6 | 2 | TA | 12 | 0 | Attchd | 907 | 0 | 0 | 0 | 0 | 0 | 8 | 2009 | Normal | 190000 |
# head(ames)Now let the summary of the dataset
library(gtsummary)
summary(ames) Order Lot.Frontage Lot.Area Lot.Shape
Min. : 1.0 Min. : 21.00 Min. : 1300 Length:2930
1st Qu.: 733.2 1st Qu.: 58.00 1st Qu.: 7440 Class :character
Median :1465.5 Median : 68.00 Median : 9436 Mode :character
Mean :1465.5 Mean : 69.22 Mean : 10148
3rd Qu.:2197.8 3rd Qu.: 80.00 3rd Qu.: 11555
Max. :2930.0 Max. :313.00 Max. :215245
NA's :490
Utilities Lot.Config Land.Slope Neighborhood
Length:2930 Length:2930 Length:2930 Length:2930
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Bldg.Type House.Style Overall.Qual Overall.Cond
Length:2930 Length:2930 Min. : 1.000 Min. :1.000
Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000
Mode :character Mode :character Median : 6.000 Median :5.000
Mean : 6.095 Mean :5.563
3rd Qu.: 7.000 3rd Qu.:6.000
Max. :10.000 Max. :9.000
Year.Built Year.Remod.Add Foundation Bsmt.Unf.SF
Min. :1872 Min. :1950 Length:2930 Min. : 0.0
1st Qu.:1954 1st Qu.:1965 Class :character 1st Qu.: 219.0
Median :1973 Median :1993 Mode :character Median : 466.0
Mean :1971 Mean :1984 Mean : 559.3
3rd Qu.:2001 3rd Qu.:2004 3rd Qu.: 802.0
Max. :2010 Max. :2010 Max. :2336.0
NA's :1
Total.Bsmt.SF BaseLivArea Central.Air X1st.Flr.SF
Min. : 0 Min. : 0.0 Length:2930 Min. : 334.0
1st Qu.: 793 1st Qu.: 0.0 Class :character 1st Qu.: 876.2
Median : 990 Median : 459.5 Mode :character Median :1084.0
Mean :1052 Mean : 492.2 Mean :1159.6
3rd Qu.:1302 3rd Qu.: 808.0 3rd Qu.:1384.0
Max. :6110 Max. :5644.0 Max. :5095.0
NA's :1
X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath
Min. : 0.0 Min. : 334 Min. :0.000 Min. :0.0000
1st Qu.: 0.0 1st Qu.:1126 1st Qu.:1.000 1st Qu.:0.0000
Median : 0.0 Median :1442 Median :2.000 Median :0.0000
Mean : 335.5 Mean :1500 Mean :1.567 Mean :0.3795
3rd Qu.: 703.8 3rd Qu.:1743 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :2065.0 Max. :5642 Max. :4.000 Max. :2.0000
Bathrooms Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
Min. :0.000 Min. :0.000 Min. :0.000 Length:2930
1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 Class :character
Median :2.000 Median :3.000 Median :1.000 Mode :character
Mean :1.756 Mean :2.854 Mean :1.044
3rd Qu.:2.500 3rd Qu.:3.000 3rd Qu.:1.000
Max. :5.000 Max. :8.000 Max. :3.000
TotRms.AbvGrd Fireplaces Garage.Type Garage.Area
Min. : 2.000 Min. :0.0000 Length:2930 Min. : 0.0
1st Qu.: 5.000 1st Qu.:0.0000 Class :character 1st Qu.: 320.0
Median : 6.000 Median :1.0000 Mode :character Median : 480.0
Mean : 6.443 Mean :0.5993 Mean : 472.8
3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.: 576.0
Max. :15.000 Max. :4.0000 Max. :1488.0
NA's :1
Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.00 Median : 27.00 Median : 0.00 Median : 0.000
Mean : 93.75 Mean : 47.53 Mean : 23.01 Mean : 2.592
3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.00 3rd Qu.: 0.000
Max. :1424.00 Max. :742.00 Max. :1012.00 Max. :508.000
Screen.Porch Mo.Sold Yr.Sold Sale.Condition
Min. : 0 Min. : 1.000 Min. :2006 Length:2930
1st Qu.: 0 1st Qu.: 4.000 1st Qu.:2007 Class :character
Median : 0 Median : 6.000 Median :2008 Mode :character
Mean : 16 Mean : 6.216 Mean :2008
3rd Qu.: 0 3rd Qu.: 8.000 3rd Qu.:2009
Max. :576 Max. :12.000 Max. :2010
SalePrice
Min. : 12789
1st Qu.:129500
Median :160000
Mean :180796
3rd Qu.:213500
Max. :755000
Let’s see a better table for the summary
ames %>%
gtsummary::tbl_summary(
) %>%
gtsummary::bold_labels() %>%
gtsummary::as_kable_extra(
format = "html",
longtable = TRUE,
linesep = ""
) %>%
kableExtra::kable_styling(
position = "left",
latex_options = c("striped", "repeat_header"),
stripe_color = "gray!15"
)| Characteristic | N = 2,930 |
|---|---|
| Order | 1,466 (733, 2,198) |
| Lot.Frontage | 68 (58, 80) |
| Unknown | 490 |
| Lot.Area | 9,437 (7,440, 11,556) |
| Lot.Shape | |
| IR1 | 979 (33%) |
| IR2 | 76 (2.6%) |
| IR3 | 16 (0.5%) |
| Reg | 1,859 (63%) |
| Utilities | |
| AllPub | 2,927 (100%) |
| NoSeWa | 1 (<0.1%) |
| NoSewr | 2 (<0.1%) |
| Lot.Config | |
| Corner | 511 (17%) |
| CulDSac | 180 (6.1%) |
| FR2 | 85 (2.9%) |
| FR3 | 14 (0.5%) |
| Inside | 2,140 (73%) |
| Land.Slope | |
| Gtl | 2,789 (95%) |
| Mod | 125 (4.3%) |
| Sev | 16 (0.5%) |
| Neighborhood | |
| Blmngtn | 28 (1.0%) |
| Blueste | 10 (0.3%) |
| BrDale | 30 (1.0%) |
| BrkSide | 108 (3.7%) |
| ClearCr | 44 (1.5%) |
| CollgCr | 267 (9.1%) |
| Crawfor | 103 (3.5%) |
| Edwards | 194 (6.6%) |
| Gilbert | 165 (5.6%) |
| Greens | 8 (0.3%) |
| GrnHill | 2 (<0.1%) |
| IDOTRR | 93 (3.2%) |
| Landmrk | 1 (<0.1%) |
| MeadowV | 37 (1.3%) |
| Mitchel | 114 (3.9%) |
| NAmes | 443 (15%) |
| NoRidge | 71 (2.4%) |
| NPkVill | 23 (0.8%) |
| NridgHt | 166 (5.7%) |
| NWAmes | 131 (4.5%) |
| OldTown | 239 (8.2%) |
| Sawyer | 151 (5.2%) |
| SawyerW | 125 (4.3%) |
| Somerst | 182 (6.2%) |
| StoneBr | 51 (1.7%) |
| SWISU | 48 (1.6%) |
| Timber | 72 (2.5%) |
| Veenker | 24 (0.8%) |
| Bldg.Type | |
| 1Fam | 2,425 (83%) |
| 2fmCon | 62 (2.1%) |
| Duplex | 109 (3.7%) |
| Twnhs | 101 (3.4%) |
| TwnhsE | 233 (8.0%) |
| House.Style | |
| 1.5Fin | 314 (11%) |
| 1.5Unf | 19 (0.6%) |
| 1Story | 1,481 (51%) |
| 2.5Fin | 8 (0.3%) |
| 2.5Unf | 24 (0.8%) |
| 2Story | 873 (30%) |
| SFoyer | 83 (2.8%) |
| SLvl | 128 (4.4%) |
| Overall.Qual | 6 (5, 7) |
| Overall.Cond | |
| 1 | 7 (0.2%) |
| 2 | 10 (0.3%) |
| 3 | 50 (1.7%) |
| 4 | 101 (3.4%) |
| 5 | 1,654 (56%) |
| 6 | 533 (18%) |
| 7 | 390 (13%) |
| 8 | 144 (4.9%) |
| 9 | 41 (1.4%) |
| Year.Built | 1,973 (1,954, 2,001) |
| Year.Remod.Add | 1,993 (1,965, 2,004) |
| Foundation | |
| BrkTil | 311 (11%) |
| CBlock | 1,244 (42%) |
| PConc | 1,310 (45%) |
| Slab | 49 (1.7%) |
| Stone | 11 (0.4%) |
| Wood | 5 (0.2%) |
| Bsmt.Unf.SF | 466 (219, 802) |
| Unknown | 1 |
| Total.Bsmt.SF | 990 (793, 1,302) |
| Unknown | 1 |
| BaseLivArea | 460 (0, 808) |
| Central.Air | |
| N | 196 (6.7%) |
| Y | 2,734 (93%) |
| X1st.Flr.SF | 1,084 (876, 1,384) |
| X2nd.Flr.SF | 0 (0, 704) |
| Gr.Liv.Area | 1,442 (1,126, 1,743) |
| Full.Bath | |
| 0 | 12 (0.4%) |
| 1 | 1,318 (45%) |
| 2 | 1,532 (52%) |
| 3 | 64 (2.2%) |
| 4 | 4 (0.1%) |
| Half.Bath | |
| 0 | 1,843 (63%) |
| 1 | 1,062 (36%) |
| 2 | 25 (0.9%) |
| Bathrooms | 2.00 (1.00, 2.50) |
| Bedroom.AbvGr | |
| 0 | 8 (0.3%) |
| 1 | 112 (3.8%) |
| 2 | 743 (25%) |
| 3 | 1,597 (55%) |
| 4 | 400 (14%) |
| 5 | 48 (1.6%) |
| 6 | 21 (0.7%) |
| 8 | 1 (<0.1%) |
| Kitchen.AbvGr | |
| 0 | 3 (0.1%) |
| 1 | 2,796 (95%) |
| 2 | 129 (4.4%) |
| 3 | 2 (<0.1%) |
| Kitchen.Qual | |
| Ex | 205 (7.0%) |
| Fa | 70 (2.4%) |
| Gd | 1,160 (40%) |
| Po | 1 (<0.1%) |
| TA | 1,494 (51%) |
| TotRms.AbvGrd | 6 (5, 7) |
| Fireplaces | |
| 0 | 1,422 (49%) |
| 1 | 1,274 (43%) |
| 2 | 221 (7.5%) |
| 3 | 12 (0.4%) |
| 4 | 1 (<0.1%) |
| Garage.Type | |
| 2Types | 23 (0.8%) |
| Attchd | 1,731 (62%) |
| Basment | 36 (1.3%) |
| BuiltIn | 186 (6.7%) |
| CarPort | 15 (0.5%) |
| Detchd | 782 (28%) |
| Unknown | 157 |
| Garage.Area | 480 (320, 576) |
| Unknown | 1 |
| Wood.Deck.SF | 0 (0, 168) |
| Open.Porch.SF | 27 (0, 70) |
| Enclosed.Porch | 0 (0, 0) |
| X3Ssn.Porch | 0 (0, 0) |
| Screen.Porch | 0 (0, 0) |
| Mo.Sold | 6 (4, 8) |
| Yr.Sold | |
| 2006 | 625 (21%) |
| 2007 | 694 (24%) |
| 2008 | 622 (21%) |
| 2009 | 648 (22%) |
| 2010 | 341 (12%) |
| Sale.Condition | |
| Abnorml | 190 (6.5%) |
| AdjLand | 12 (0.4%) |
| Alloca | 24 (0.8%) |
| Family | 46 (1.6%) |
| Normal | 2,413 (82%) |
| Partial | 245 (8.4%) |
| SalePrice | 160,000 (129,500, 213,500) |
| 1 Median (Q1, Q3); n (%) |
1.3 Variable separation
The dataset contains both numeric and categorical variables. We need to separate them for further analysis. This is especially important for feature engineering.
num_vars<-colnames(ames[sapply(ames, is.numeric) == TRUE])
cat_vars<-colnames(ames[sapply(ames, is.character) == TRUE])1.4 Side note on text and variable outputs
There are two ways we can use the variables from the r code in the text. The first one is by using stmt output using print and the other is using r output.
1.4.1 Statement output
stmt <- paste("There are", length(num_vars), "numerical features and", length(cat_vars), "categorical features" )
print(stmt, na.print = NULL)[1] "There are 29 numerical features and 12 categorical features"
1.4.2 r output
There are 29 numerical features and 12 categorical features in this dataset.
2 Resampling Methods
2.1 Test Set Approach
There are multiple ways to split your data in R (four ways shown below)Simple Random Sampling is one of them.
2.1.1 Using base R
set.seed(123) # for reproducibility
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.8))
train_1 <- ames[index_1, ]
test_1 <- ames[-index_1, ]
# kable(head(train), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
print(dim(train_1))[1] 2344 41
print(dim(test_1))[1] 586 41
2.1.2 Using caret package
set.seed(123) # for reproducibility
index_2 <- createDataPartition(ames$SalePrice, p = 0.7,
list = FALSE)
train_2 <- ames[index_2, ]
test_2 <- ames[-index_2, ]2.1.3 Using rsample package
set.seed(123) # for reproducibility
split_1 <- initial_split(ames, prop = 0.7)
train_3 <- training(split_1)
test_3 <- testing(split_1)2.1.4 Using h2o package
split_2 <- h2o.splitFrame(ames.h2o0, ratios = 0.7,
seed = 123)
train_4 <- split_2[[1]]
test_4 <- split_2[[2]]2.2 The Validation Set Approach
Draw a sample of size nrow(ames) from the numbers 1 to 3 (with replacement). You want approximately 70% of the sample to be 1 and the remaining 30% to be equally split between 2 and 3.
set.seed(123)
part <-sample(1:3, size=nrow(ames), prob=c(0.7, 0.15, 0.15), replace=TRUE)We can also create training, validation, and testing sets from the original data frame directly.
train_5 <-ames[part == 1, ] #subset ames to training indices only
valid_5 <-ames[part == 2, ]
test_5 <-ames[part == 3, ] 2.3 Stratified Random Sampling
The easiest way to perform stratified sampling on a response variable is to use the rsample package, where you specify the response variable to strata. The following illustrates that in our original employee attrition data we have an imbalanced response (No: 84%, Yes: 16%). By enforcing stratified sampling, both our training and testing sets have approximately equal response distributions.
Let’s import Job Attrition data from the rsample package.
# Job attrition data
library(rsample)
library(modeldata)
churn <- attrition %>%
mutate_if(is.ordered, .funs = factor, ordered = FALSE)
churn.h2o <- as.h2o(churn)2.3.1 Original response distribution
t(table(churn$Attrition)) %>% prop.table() %>% round(4) %>% kbl(format = "html",table.attr = "style='width:30%;'") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F, position = "center") | No | Yes |
|---|---|
| 0.8388 | 0.1612 |
2.3.2 Stratified sampling with the rsample package
set.seed(123)
split_strat <- initial_split(churn, prop = 0.7,
strata = "Attrition")
train_strat <- training(split_strat)
test_strat <- testing(split_strat)
split <- initial_split(ames, prop = 0.7,
strata = "SalePrice")
ames_train <- training(split)
ames_test <- testing(split)2.3.3 Consistent response ratio between train & test
library(dplyr)
library(kableExtra)
train_strat$Attrition %>%
table() %>%
t() %>%
prop.table() %>%
round(4) %>%
knitr::kable(format = "html", table.attr = "style='width:30%;'") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F,
position = "center"
)| No | Yes |
|---|---|
| 0.8395 | 0.1605 |
Now Test Dataframe
t(table(test_strat$Attrition)) %>% prop.table() %>% round(4) %>% knitr::kable(format = "html", table.attr = "style='width:30%;'") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F, position = "center") | No | Yes |
|---|---|
| 0.8371 | 0.1629 |
3 Feature Engineering
Feature engineering is the process of transforming raw data into informative features that better represent the underlying patterns in the data, aiming to improve the performance of machine learning models. By leveraging domain knowledge, statistical techniques, and creative transformations, practitioners craft new features, select relevant ones, or refine existing variables to make algorithms more effective and efficient. This process often plays a crucial role in enhancing model accuracy, generalization, and interpretability (Heaton 2016).
3.1 Loading packages
library(dplyr) # for data manipulation
library(ggplot2) # for awesome graphics
library(visdat) # for additional visualizations
# Feature engineering packages
library(caret) # for various ML tasks
library(recipes) # for feature engineering tasks3.2 Transformation
Let’s create a histogram of the SalePrice variable to understand the distribution of the variable.
hist(ames$SalePrice, col ='blue')It seems like the variable is skewed. We can apply a transformation to normalize the data to stick to the normality assumption. Let’s also calculate the moments of the variable to understand how much data is skewed. Also, this looks ugly; can we make it look better?
library(ggplot2)
# color = color of the histogram border
library(ggplot2)
library(scales)
ggplot(ames, aes(x = SalePrice)) +
geom_histogram(aes(y = after_stat(density)), fill = "#336699", color = "black", alpha = 0.65, bins = 25) +
labs(x = "Sale Price", y = "Density", title="Histogram of Sale Price") +
scale_x_continuous(labels = label_comma(accuracy = 1)) +
# scale_y_continuous(labels = label_comma(accuracy = 1)) +
theme_light() +
theme(
plot.title = element_text(hjust = 0, face = "bold", size = 14), # Left-align (hjust = 0) and bold the title
axis.title.x = element_text(size = 10, color = "black", face="bold"),
axis.title.y = element_text(size = 10, color = "black", face="bold"),
axis.text.x = element_text(size = 8, color = "black"),
axis.text.y = element_text(size = 8, color = "black")
)# install.packages("moments")
library(moments)
saleprice_skew <- skewness(ames$SalePrice)Skewness is 1.7426074 (>1), which indicates that data is positively or right skewed.
3.3 Variable Transformation
3.3.1 Log Transformation
Transform the variable to a log transformation (natural log of variable)
ln_sales <-log(ames$SalePrice)
hist(ln_sales, col='blue', breaks = 40, main = "Histogram of Log Sale Price", xlab = "Log Sale Price", ylab = "Frequency")log_sales = skewness(ln_sales)The distribution appears to conform to a normal distribution. The skewness value is -0.0147859. The transformation has caused the distribution to become slightly negatively skewed. However, the variable’s distribution (output) is now closer to 0.
3.3.2 Square Root Transformation
sqt_sales <- sqrt(ames$SalePrice)
hist(sqt_sales, col='blue')sq_skewness<- skewness(sqt_sales)The distribution is not as normal as that of log transformation. The skewness value is 0.8843168, which means data is moderately positively skewed and performs worse than log transformation.
3.3.3 Cube-Root Transformation
cbt_sales <- (ames$SalePrice)^(1/3)
hist(cbt_sales, col='blue')cbt_skewnes <- skewness(cbt_sales)The data distribution is worse than log transformation but better than square root. The skewness value is 0.6069153, meaning data is positively skewed and performs better than square root (you may use log transformation).
3.3.4 Box-Cox Transformation
# | message: false
# load package
library(sjPlot)
Attaching package: 'sjPlot'
The following object is masked from 'package:ggplot2':
set_theme
library(sjmisc)
library(sjlabelled)
Attaching package: 'sjlabelled'
The following object is masked from 'package:ggplot2':
as_label
The following object is masked from 'package:dplyr':
as_label
m.ols <- lm(SalePrice ~ Gr.Liv.Area, data=ames)
ols_summ <- summary(m.ols)
tab_model(m.ols)| Sale Price | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 13289.63 | 6878.48 – 19700.78 | <0.001 |
| Gr Liv Area | 111.69 | 107.64 – 115.75 | <0.001 |
| Observations | 2930 | ||
| R2 / R2 adjusted | 0.500 / 0.499 | ||
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:gtsummary':
select
The following object is masked from 'package:dplyr':
select
bc <- boxcox(m.ols)str(bc)List of 2
$ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
$ y: num [1:100] -14038 -13806 -13577 -13351 -13127 ...
(bc.power <- bc$x[which.max(bc$y)])[1] 0.1818182
y_{transform}= (y^{\lambda}-1)/\lambda We need to transform the response variable accordingly and re-compute the linear model with this transformed variable. We can write a function corresponding to the formula:
BCTransform <- function(y, lambda=0) {
if (lambda == 0L) { log(y) }
else { (y^lambda - 1) / lambda }
}
#Reverse Transformation
BCTransformInverse <- function(yt, lambda=0) {
if (lambda == 0L) { exp(yt) }
else { exp(log(1 + lambda * yt)/lambda) }
}
ames$SalePrice.bc <- BCTransform(ames$SalePrice, bc.power)
hist(ames$SalePrice.bc, breaks=30); rug(ames$SalePrice.bc)SPbc_skewness <- skewness(ames$SalePrice.bc)
summary(m.ols.bc <- lm(SalePrice.bc ~ Gr.Liv.Area, data=ames))
Call:
lm(formula = SalePrice.bc ~ Gr.Liv.Area, data = ames)
Residuals:
Min 1Q Median 3Q Max
-21.5363 -1.3894 0.1928 1.4254 8.5499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.594e+01 1.497e-01 240.14 <2e-16 ***
Gr.Liv.Area 5.085e-03 9.456e-05 53.78 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.587 on 2928 degrees of freedom
Multiple R-squared: 0.4969, Adjusted R-squared: 0.4967
F-statistic: 2892 on 1 and 2928 DF, p-value: < 2.2e-16
3.4 Feature Scaling
Let’s look at one of the continuous variables, Living Area. We want to scale the feature. In general, feature scaling refers to the process of standardizing the range of data features. Since the content of raw data values varies widely, in some machine learning algorithms, objective functions will not work properly without normalization.
We can perform Min-Max scaling or Z-score scaling. Min-Max Scaling transforms the data to a range between 0 and 1. Z-score scaling transforms the data into a mean of 0 and a standard deviation of 1.
The formula for Min-Max scaling is:
\begin{align} x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \end{align}
The formula for Z-score scaling is:
\begin{align} z = \frac{x - \mu}{\sigma} \end{align}
where x is the original feature vector, \mu is the mean of that feature vector, and \sigma is its standard deviation.
G <-ames$Gr.Liv.Area
A <-ames$Lot.Area
max_A <- max(A)
min_A <- min(A)
max_G <- max(G)
min_G <- min(G)The range of Gr.Liv.Area is 334 to 5642. The range of Lot.Area is 1300 to 215245.
3.4.1 Min-Max Scaling
Gr.Liv.Area_N = (G-min_G)/(max_G-min_G)
Lot.Area_N = (A-min_A)/(max_A-min_A)
minmaxscaler <-cbind.data.frame(ames, Gr.Liv.Area_N, Lot.Area_N)
head(minmaxscaler) Order Lot.Frontage Lot.Area Lot.Shape Utilities Lot.Config Land.Slope
1 117 80 9600 Reg AllPub Inside Gtl
2 325 94 9400 Reg AllPub Corner Gtl
3 337 70 7700 Reg AllPub Inside Gtl
4 393 60 9000 Reg AllPub FR2 Gtl
5 590 NA 13774 IR1 AllPub Inside Gtl
6 667 81 9671 Reg AllPub Corner Gtl
Neighborhood Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built
1 NWAmes 1Fam 2Story 6 6 1971
2 Mitchel Duplex 2Story 6 5 1971
3 Mitchel Duplex 2Story 5 2 1985
4 NAmes Duplex 2Story 5 5 1974
5 NWAmes 1Fam 2Story 7 7 1977
6 NAmes Duplex 2Story 6 5 1969
Year.Remod.Add Foundation Bsmt.Unf.SF Total.Bsmt.SF BaseLivArea Central.Air
1 1971 CBlock 386 715 329 Y
2 1971 CBlock 912 912 0 Y
3 1986 PConc 1216 1216 0 Y
4 1974 CBlock 896 896 0 Y
5 1992 PConc 476 908 432 Y
6 1969 CBlock 1248 1248 0 Y
X1st.Flr.SF X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath Bathrooms
1 930 715 1645 1 2 2
2 912 912 1824 2 2 3
3 1216 1216 2432 4 2 5
4 896 896 1792 2 2 3
5 1316 972 2288 1 2 2
6 1248 1296 2544 2 2 3
Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Fireplaces Garage.Type
1 4 1 TA 7 0 Attchd
2 4 2 TA 8 0 <NA>
3 4 2 TA 10 0 Attchd
4 4 2 TA 8 0 <NA>
5 4 1 Gd 8 2 Attchd
6 6 2 TA 12 0 Attchd
Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
1 441 0 78 0 0
2 0 128 0 0 0
3 616 200 0 0 0
4 0 32 45 0 0
5 520 321 72 0 0
6 907 0 0 0 0
Screen.Porch Mo.Sold Yr.Sold Sale.Condition SalePrice SalePrice.bc
1 0 6 2010 Normal 171000 43.68317
2 0 4 2010 Normal 139000 41.86486
3 0 2 2010 Normal 159000 43.03681
4 0 9 2009 Normal 136000 41.67733
5 156 11 2009 Normal 230000 46.40657
6 0 8 2009 Normal 190000 44.63443
Gr.Liv.Area_N Lot.Area_N
1 0.2469857 0.03879502
2 0.2807084 0.03786020
3 0.3952524 0.02991423
4 0.2746797 0.03599056
5 0.3681236 0.05830470
6 0.4163527 0.03912688
3.4.2 Standardize (z scores)
z_Gr.Liv.Area <- (ames$Gr.Liv.Area -mean(ames$Gr.Liv.Area))/sd(ames$Gr.Liv.Area)
z_Lot.Area <-(ames$Lot.Area -mean(ames$Lot.Area))/sd(ames$Lot.Area)
zscore <-cbind.data.frame(ames, z_Gr.Liv.Area, z_Lot.Area)
head(zscore) Order Lot.Frontage Lot.Area Lot.Shape Utilities Lot.Config Land.Slope
1 117 80 9600 Reg AllPub Inside Gtl
2 325 94 9400 Reg AllPub Corner Gtl
3 337 70 7700 Reg AllPub Inside Gtl
4 393 60 9000 Reg AllPub FR2 Gtl
5 590 NA 13774 IR1 AllPub Inside Gtl
6 667 81 9671 Reg AllPub Corner Gtl
Neighborhood Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built
1 NWAmes 1Fam 2Story 6 6 1971
2 Mitchel Duplex 2Story 6 5 1971
3 Mitchel Duplex 2Story 5 2 1985
4 NAmes Duplex 2Story 5 5 1974
5 NWAmes 1Fam 2Story 7 7 1977
6 NAmes Duplex 2Story 6 5 1969
Year.Remod.Add Foundation Bsmt.Unf.SF Total.Bsmt.SF BaseLivArea Central.Air
1 1971 CBlock 386 715 329 Y
2 1971 CBlock 912 912 0 Y
3 1986 PConc 1216 1216 0 Y
4 1974 CBlock 896 896 0 Y
5 1992 PConc 476 908 432 Y
6 1969 CBlock 1248 1248 0 Y
X1st.Flr.SF X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath Bathrooms
1 930 715 1645 1 2 2
2 912 912 1824 2 2 3
3 1216 1216 2432 4 2 5
4 896 896 1792 2 2 3
5 1316 972 2288 1 2 2
6 1248 1296 2544 2 2 3
Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Fireplaces Garage.Type
1 4 1 TA 7 0 Attchd
2 4 2 TA 8 0 <NA>
3 4 2 TA 10 0 Attchd
4 4 2 TA 8 0 <NA>
5 4 1 Gd 8 2 Attchd
6 6 2 TA 12 0 Attchd
Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
1 441 0 78 0 0
2 0 128 0 0 0
3 616 200 0 0 0
4 0 32 45 0 0
5 520 321 72 0 0
6 907 0 0 0 0
Screen.Porch Mo.Sold Yr.Sold Sale.Condition SalePrice SalePrice.bc
1 0 6 2010 Normal 171000 43.68317
2 0 4 2010 Normal 139000 41.86486
3 0 2 2010 Normal 159000 43.03681
4 0 9 2009 Normal 136000 41.67733
5 156 11 2009 Normal 230000 46.40657
6 0 8 2009 Normal 190000 44.63443
z_Gr.Liv.Area z_Lot.Area
1 0.2874520 -0.06953307
2 0.6415507 -0.09491373
3 1.8442990 -0.31064928
4 0.5782481 -0.14567503
5 1.5594376 0.46016117
6 2.0658580 -0.06052294
3.5 Feature Construction
3.5.1 Unsupervised Binning
Automatic Binning can be used to determine the number of bins to be created. The bins are created based on the distribution of the data. The bins are created so that the data points in each bin are more similar and the data points in different bins are less similar.
summary(ames$Overall.Qual) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 5.000 6.000 6.095 7.000 10.000
count(ames, Overall.Qual) Overall.Qual n
1 1 4
2 2 13
3 3 40
4 4 226
5 5 825
6 6 732
7 7 602
8 8 350
9 9 107
10 10 31
Overall.Qual_Bin <- cut(ames$Overall.Qual, breaks=4)
head(Overall.Qual_Bin)[1] (5.5,7.75] (5.5,7.75] (3.25,5.5] (3.25,5.5] (5.5,7.75] (5.5,7.75]
Levels: (0.991,3.25] (3.25,5.5] (5.5,7.75] (7.75,10]
3.5.2 Manual binning
We need to know the min and max values. For example, say min=1 and max=10 then the bins will be 0-5, 5-6, 6-7, 7-10.
Overall.Qual_MBin <- cut(ames$Overall.Qual, breaks = c(0, 5, 6, 7, 10))
barchart(Overall.Qual_MBin, main = "Overall Quality Manual Binning")Sometimes, features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data, but several of them have fewer observations than the others.
count(train_1, Neighborhood) %>% arrange(n) %>% head() %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F, position = "center") | Neighborhood | n |
|---|---|
| Landmrk | 1 |
| GrnHill | 2 |
| Blueste | 8 |
| Greens | 8 |
| NPkVill | 15 |
| Veenker | 21 |
In the above examples, we may want to collapse all levels observed in less than 10% of the training sample into an “other” category. We can use step_other() to do so. For example, Screen.Porch has 92% values recorded as zero (zero square footage meaning no screen porch), and the remaining 8% have unique dispersed values.
library(flextable)
Attaching package: 'flextable'
The following object is masked from 'package:gtsummary':
continuous_summary
The following objects are masked from 'package:kableExtra':
as_image, footnote
arrange(count(train_1, Screen.Porch),n) %>% flextable()Screen.Porch | n |
|---|---|
40 | 1 |
53 | 1 |
60 | 1 |
63 | 1 |
64 | 1 |
80 | 1 |
88 | 1 |
90 | 1 |
94 | 1 |
95 | 1 |
99 | 1 |
104 | 1 |
109 | 1 |
111 | 1 |
113 | 1 |
117 | 1 |
119 | 1 |
121 | 1 |
122 | 1 |
123 | 1 |
126 | 1 |
128 | 1 |
130 | 1 |
135 | 1 |
141 | 1 |
143 | 1 |
145 | 1 |
150 | 1 |
152 | 1 |
154 | 1 |
162 | 1 |
163 | 1 |
164 | 1 |
165 | 1 |
166 | 1 |
178 | 1 |
184 | 1 |
185 | 1 |
190 | 1 |
197 | 1 |
198 | 1 |
201 | 1 |
217 | 1 |
220 | 1 |
221 | 1 |
222 | 1 |
231 | 1 |
234 | 1 |
252 | 1 |
260 | 1 |
263 | 1 |
264 | 1 |
270 | 1 |
271 | 1 |
276 | 1 |
280 | 1 |
287 | 1 |
291 | 1 |
312 | 1 |
342 | 1 |
348 | 1 |
374 | 1 |
385 | 1 |
396 | 1 |
440 | 1 |
490 | 1 |
576 | 1 |
92 | 2 |
108 | 2 |
116 | 2 |
138 | 2 |
140 | 2 |
170 | 2 |
175 | 2 |
176 | 2 |
182 | 2 |
196 | 2 |
204 | 2 |
227 | 2 |
240 | 2 |
255 | 2 |
259 | 2 |
266 | 2 |
322 | 2 |
100 | 3 |
112 | 3 |
115 | 3 |
142 | 3 |
147 | 3 |
153 | 3 |
156 | 3 |
195 | 3 |
210 | 3 |
225 | 3 |
256 | 3 |
155 | 4 |
189 | 4 |
288 | 4 |
160 | 5 |
180 | 5 |
224 | 5 |
200 | 6 |
216 | 7 |
120 | 8 |
144 | 8 |
168 | 8 |
192 | 9 |
0 | 2,137 |
3.5.3 Standardize and Normalize
# Normalize all numeric columns
recipe(SalePrice ~ ., data = train_1) %>%
step_YeoJohnson(all_numeric())
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 40
── Operations
• Yeo-Johnson transformation on: all_numeric()
##standardize
# train_1 %>%
# step_center(all_numeric(), -all_outcomes()) %>%
# step_scale(all_numeric(), -all_outcomes())3.5.4 Lump levels for two features
# lumping <- recipe(SalePrice ~ ., data = train_1) %>%
# step_other(Neighborhood, threshold = 0.01,
# other = "other") %>%
# step_other(Screen.Porch, threshold = 0.1,
# other = ">0")
lumping <- recipe(SalePrice ~ ., data = train_1) %>%
step_mutate(Screen.Porch = factor(ifelse(Screen.Porch > 0, ">0", "0"))) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_other(Screen.Porch, threshold = 0.1, other = ">0")3.5.4.1 Modified Code
recipe(Sale_Price ~ ., data = train_1):
- Initialization remains similar but uses Sale_Price as the outcome instead of
SalePrice.
step_mutate(Screen.Porch = factor(ifelse(Screen.Porch > 0, ">0", "0"))):- This new step converts the numeric Screen.Porch variable into a binary categorical variable. If the value in
Screen.Porchis greater than 0, it’s labeled as “>0”, otherwise “0”. step_other(Neighborhood, threshold = 0.01, other = "other"):- This step remains unchanged from the original code.
step_other(Screen.Porch, threshold = 0.1, other = ">0"):- Similar to the original code, but applied on the newly created binary categorical
Screen.Porch.
3.5.5 Apply this blueprint
apply_2_training <- prep(lumping, training = train_1) %>%
bake(train_1)3.5.6 New distribution of Neighborhood
count(apply_2_training, Neighborhood) %>% arrange(n) %>% flextable()Neighborhood | n |
|---|---|
BrDale | 27 |
MeadowV | 34 |
ClearCr | 36 |
StoneBr | 39 |
SWISU | 41 |
NoRidge | 54 |
Timber | 58 |
IDOTRR | 74 |
other | 77 |
Crawfor | 88 |
BrkSide | 89 |
Mitchel | 99 |
NWAmes | 103 |
SawyerW | 103 |
Sawyer | 126 |
NridgHt | 130 |
Gilbert | 133 |
Edwards | 141 |
Somerst | 142 |
OldTown | 183 |
CollgCr | 213 |
NAmes | 354 |
3.5.7 New distribution of Screen.Porch
count(apply_2_training, Screen.Porch) %>% arrange(n) %>% flextable()Screen.Porch | n |
|---|---|
>0 | 207 |
0 | 2,137 |
3.5.8 Variable Encoding
The categorical variables are very often found in data while conducting data analysis and ML(machine learning). The Data which can be classified into categories or groups, such as colors or job titles is generally called as categorical data. The categorical variables must be encoded into numerical values in order to be used in statistical analysis or ML models.
Warning: Dummies package is no longer available in R repository.
3.5.8.1 Using fastDummies package
We can create dummy variables for all cat variables in the dataset
library(fastDummies)
ames_train_dummies <- dummy_cols(train_1, select_columns = NULL, remove_first_dummy = TRUE)
ames_test_dummies <- dummy_cols(test_1, select_columns = NULL, remove_first_dummy = TRUE)3.5.8.2 One-hot encoding (caret)
# Create dummy variables transformation
dummies <- dummyVars(SalePrice ~ ., data = ames)
# Apply transformation to dataset
ames_dummies <- data.frame(predict(dummies, newdata = ames))3.5.8.3 Label encoding (caret)
# Identify columns with class 'factor'
factor_columns <- sapply(ames, is.factor)
# Apply label encoding on those columns
ames[factor_columns] <- lapply(ames[factor_columns], function(x) as.numeric(as.factor(x)))Here, we only had one class variable for multiple class variables, convert all into factors first. Apply the above R code to simultaneously get various categories.
3.6 Missing Values
3.6.1 #Missing data visualization
Let’s look at all missingness in Ames dataset
library(visdat)
vis_miss(AmesHousing::ames_raw, cluster = TRUE)Missing value imputation Features where null / missing values are less than 40% for categorical variables are replaced with mode which is value whose frequency is maximum.
Mode = function(x){
ta = table(x)
tam = max(ta)
mod = names(ta)[ta==tam]
return(mod)
}Identify numeric variables and treat them with mean value if any missing / null values exists
for (col_name in colnames(ames[sapply(ames, is.factor) == TRUE])) {
if (sum(is.na(ames[[col_name]])) > 0) {
ames[col_name][is.na(ames[col_name])] <- Mode(ames[[col_name]])
stmt <- paste('Null values of', col_name, ' feature has been replaced by mode value ',
Mode(ames[[col_name]]))
print(stmt, na.print = NULL)
}
}\begin{align} log(SalePrice) = \beta_{o}+\beta_{1}X_{1}\\ SalePrice = e^{\beta_{o}+\beta_{1}X_{1}}\\ SalePrice = B e^{\beta_{1}X_{1}}\\ \end{align}